比利时专利BE1022562B1 Optical character recognition method

专利PDF首页>>比利时专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
The optical character recognition method applies a first OCR engine to provide character identification of at least a first type of characters and areas of at least a second type of characters in the character string image. A second OCR engine is applied to the areas of the at least second type of character to provide character identification of a second type of character. The characters identified by the first ROC engine and the second ROC engine are combined in a further step to obtain the character identification of the character string image.
公开号:BE1022562B1
申请号:E2015/5354
申请日:2015-06-09
公开日:2016-06-02
发明作者:Frédéric Collet；Jordi Hautot；Michel Dauw；Muelenaere Pierre De；Olivier Dupont；Günter Hensges
申请人:I.R.I.S. S.A.；
IPC主号:

专利说明:

The invention relates to a method for optical character recognition, and more specifically to an optical character recognition method for recognizing more than one type of characters.
Prior art of (Invention
Optical Character Recognition (OCR) methods convert the image of text into machine-readable code using a character recognition method to identify the characters represented on the image.
Optical character recognition methods start with an image comprising a character string and provide, with an OCR engine, an identification (ID) of the characters present in the character string, i.e., an identification of the characters in machine-readable code to obtain a searchable string of characters.
There are many OCR engines. They must work quickly with limited computing resources and accurately recognize characters. Speed, limited resources and accuracy are conflicting requirements and, in practice, a good OCR engine is based on tradeoffs between these features.
An OCR engine designed for Latin character recognition (eg English) is different from an engine designed for Asian (Chinese, Japanese and Korean) character recognition or Arabic characters. For example, the identification database is different, although some characters such as punctuation and numeric characters may be present in more than one database. The Latin character database can contain less than 100 characters, while the Asian character database can contain approximately 5000 characters per language. Therefore, an OCR engine designed for Asian characters typically requires more memory than an OCR engine designed for Latin characters. Algorithms that must take into account the diversity of characters are optimized differently because of this great difference in the number of characters. Characteristics used for character recognition are different because the Latin character shapes are simpler than the characters of the Asian characters that can contain many features, but the Latin character shapes are more variable because of the high number of fonts Latin characters. In addition, contextual decision algorithms that make the final decision about character identification using linguistic and typographic patterns are different. Linguistic models for Latin languages use in particular a language dictionary with the probability of word occurrence, whereas language models for Asian languages use in particular n-grams of characters with probabilities of occurrence. A n-gram of characters is a sequence of n consecutive characters.) Another reason why OCR engines are different for Latin characters and Asian characters is that there are no spaces between them. words in Chinese or Japanese texts.
All in all, the use of a known OCR engine for multiple types of characters such as te® Latin and Asian characters does not give the desired result of being accurate, fast, and undemanding in terms of resources. .calcul .. That's why the, known OCR engines! Typically designed for a typical block of characters, and if a known OCR ursotsur includes the ability to recognize characters of another type, its accuracy in recognizing this other type of character is typically low. This lack of accuracy is particularly problematic because many documents today contain a mixture of different types of characters, such as for example a Japanese invoice or purchase order. contains Japanese text, but also English names, English postal addresses, e-mail addresses, amounts in figures, ... '~ · · • Summary of the fnv.
An object of this invention is to provide a character identification precedence for a fast identification if these characters in a string image are accurate.
Another object of this invention is to provide a computer program product for implementing said character identification system. . . These box trees are affected according to the invention as described in the independent claims. '
In a first aspect, the present invention provides a method for identifying characters in an image of this character string, including the method; . · (I) restoring a first OCR engine to provide a. identification of characters of at least a first type of characters if fields of at least a second type of characters in the string image this characters, (ii) application to the fields of the at least second type of characters of a character second OCR engine to provide identification of characters of a second type of characters, and (iii) combining the characters identified by the first OCR engine and the second OCR engine to obtain identification of the characters of the OCR. string image, in which the first OCR engine includes a segmentation "! of the string image into segmentation portions and includes for each segmentation portion the steps of (a) applying a first character classification to provide a first plurality of assumptions on at least one character represented by the segmentation portion and a first plurality of probabilities associated with the hypotheses of the first plurality of hypotheses, (b) verifying whether the first plurality of hypotheses satisfies at least one condition, (c) applying, if at least one condition is satisfied, a second character classification to provide a second plurality of hypotheses on the at least one character represented by the segmentation portion and a second plurality of probabilities associated with the hypotheses of its second plurality of assumptions.
In such a method, the characters of the first type are directly analyzed by the first character classification in the first OCR engine, their processing is fast and accurate. Only the characters for which a doubt exists after the first character classification in the first OCR engine are analyzed by a second character classification in the first OCR engine, the doubt being evaluated by the check at step b). This selection of characters to be analyzed by the second classification of characters makes the process particularly fast. A second OCR engine is then used only in areas where a type of character other than the first type of character has been detected to increase the character identification accuracy of the second type of characters. The fact that the second OCR engine is only used on areas where a character type other than the first type of character has been detected causes the second OCR engine to be used only when it is needed. For example, the second OCR engine is not used at all in a text comprising only characters of the first type, but if a text also contains characters of the second type of characters in some places such as an email address in a Chinese invoice. , the accuracy of their identification is high.
All in all, the characters of the second type of characters are analyzed twice during this process, at two different levels (character classification and complete OCR engine), which gives excellent accuracy.
In one embodiment of the invention, the first OCR engine uses a character database comprising characters of the first type of characters. The first OCR engine identifies characters of the first type of characters, which is based on your recognition of character shapes by comparison with character patterns present in the database.
In one embodiment of the invention, the first character classification uses a character database comprising characters of the first type of characters. The purpose of the first character classification is to classify the characters of the first type as early as possible in the identification process because the smaller the number of steps to which these characters are subject, the faster the identification process is. .
In an embodiment according to the invention, the first character classification is capable of detecting characters of at least one type of character which is the first type of character. This is particularly interesting because a detection of another type of characters in a segmentation part is a strong indicator that a second character classification, designed for another type of character, will increase the accuracy of the process of identification.
In one embodiment of the invention, the first type of character is an Asian type of character.
In one embodiment of the invention, the second OCR engine uses a character database comprising characters of the second type of characters. The second OCR engine identifies characters of the second type of characters, which is based on the recognition of the shapes of the characters by comparison with character models present in the database.
In an embodiment according to the invention, the second character classification uses a character database comprising characters of the second type of characters. The purpose of the second classification of characters is to classify the characters of the second type, which provides assumptions about their identification associated with probabilities. It is possible to choose, in a next step of the first OCR engine using contextual decisions, an identification of the first or second type.
In an embodiment of the invention, the second type of characters is a Latin type of characters.
In an embodiment according to the invention, the first type of is the Latin type of characters, the Arabic type of characters, the Hebrew type of characters, the Cyrillic type of characters, the Greek type of characters or the hieroglyph type of characters. . The method can be applied to any type of character, including those mentioned here.
In an embodiment according to the invention, the second type of characters is the Asian type of characters, the Arabic type of characters, the Hebrew type of characters, the Cyrillic type of characters, the Greek type of characters, or the hieroglyph type of characters. characters. The method can be applied to any type of character, including those mentioned here.
In an embodiment according to {invention, at least one of the first character classification and the second character classification is a classification of individual characters.
In an embodiment according to the invention, the zones of the at least second type of characters are groups of segmentation parts in which certain segments of segmentation satisfy at least one of the following conditions: • all the probabilities of the hypotheses provided by the first character classification for said segmentation portion are below a given threshold; A hypothesis among the hypotheses for the first segmentation segment concerns a character of the first type of characters known to resemble a character of the at least second type of characters; A hypothesis among the hypotheses for said segmentation portion relates to a type of characters other than the first type of characters; and a character of the second type of characters has been identified on said segmentation portion by the first OCR engine.
For identification to be fast, it is important that the second OCR engine be applied only to text boxes where characters other than the first type of characters are likely to be present. If an area meets at least one of the criteria listed here, there is a good chance that characters other than the first type of character are present.
In an embodiment according to the invention, said at least one condition is one of the following conditions: • all the probabilities of the hypotheses of the first classification of characters are below a given threshold; • one hypothesis among the assumptions of the first character classification relates to a character of the first type of character known as a character of another type of characters; and a hypothesis among the hypotheses of said first character classification relates to a type of characters other than the first type of characters.
For identification to be fast, it is important that the second OCR engine be applied only to segments of segmentation where characters other than the first type of characters are likely to be present. If a segmentation part fulfills at least one of the criteria listed here, there is a good chance that characters other than the first type of character are present.
In one embodiment of the invention, the segmentation of the character string image into segmentation portions comprises the steps of: • determining a first starting point coordinate of a coritrostant pixel with the mouse-plane; generating a list of potential character widths depending on a maximum character width and characteristics of the character string image portion corresponding to the maximum character width, and • determining a second part of the character string image corresponding to the first starting point coordinate and the first width on the list of potential character widths.
A segmentation method based on the width of characters like this is particularly effective for Asian texts where the characters are not grouped into words.
In an embodiment according to the invention, the first plurality of probabilities corresponds to a first probability scale, the second plurality of probabilities corresponds to a second probability scale, and the method comprises a step of transforming at least one of the first plurality of probabilities and the second plurality of probabilities for scaling the first or second probability scale so that the first plurality of probabilities and the second plurality of probabilities can be compared to obtain a first or second plurality of probabilities; of transformed probabilities.
A problem can arise when the probabilities provided by the two classifications are not in the same scale. A stage of transformation of one of the probabilities is thus necessary in order to be able to consider them in the same way during a subsequent stage of contextual decision.
In one embodiment of the invention, the first OCR engine further comprises a step of making a contextual decision for identifying characters of the at least first type of characters based on the assumptions of the first character classification. with their corresponding probabilities and hypotheses of the second character classification with their corresponding probabilities for all segments of segmentation.
A large number of assumptions can be generated by character classification of all segmentation parts of the character-chain image. The contextual decision determines, based on the probabilities of character identification hypotheses generated by the first character classification and the second character classification, whether the second character classification has been used, and on the basis of the character context. , the character identification that is the result of the first OCR engine.
In one embodiment of the invention, the step of making a contextual decision uses at least one of the decision support tools among a decision graph, a linguistic model and a typographic model. It was found that a contextual decision using one or more decision support tools was particularly fast and accurate in OCR. For example, looking for the shortest path in a decision graph makes it possible to take into account, in the decision to identify the characters of the entire image, the probabilities generated by the character classifications and the probabilities generated by the application of linguistic models, typographic models.
In a second aspect, the present invention provides a computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a string image, control logic comprising: (i) first computer readable program code means for applying a first OCR engine to provide character identification of at least a first type of characters and areas of at least a second type character string, (il) second computer-readable program code means for applying to the areas of at least one second type of characters a second OCR engine to provide a character identification of a second type of characters, and (iii) third computer-readable program code means for combining the ca identified by the first OCR engine and the second OCR engine to obtain (character identification of the character string image, wherein the first OCR engine includes a segmentation of the character string image into segments of segmentation and includes for each segmentation portion the steps of (a) applying a first character classification to provide a first plurality of hypotheses on at least one character represented by the segmentation portion and a first plurality of associated probabilities. to the hypotheses, (b) to check whether the first plurality of hypotheses satisfies at least one condition, and (c) to apply, if at least one condition is satisfied, a second classification of characters to provide a second plurality of hypotheses on the at least one character represented by the segmentation part and a second plurality of probabilities ass associated with the assumptions.
Such a computer product enables the invention to be applied very efficiently to provide identification in a machine-readable code of the characters represented by the character string image.
In a third aspect, the present invention provides a method for identifying characters in a string image, the method comprising: (i) applying an OCR engine designed for Asian characters to provide character identification Asian and non-Asian character area in the character string image, (ii) the application, on non-Asian character areas of an OCR engine designed for Latin characters to provide Latin character identification, and (iii) the combination of the characters identified by the OCR engine designed for Asian characters and the OCR engine designed for Latin characters to obtain {Identification of Asian and Latin characters of the string image, in which the engine of OCR designed for Asian characters includes the steps of: (A) segmenting the string image into segments of segm entation, (B) applying, for each segmentation part, a classification of individual characters designed for Asian characters to provide a first plurality of hypotheses on at least one character represented by the segmentation part and a first plurality of associated probabilities. Assumptions, and (C) make a contextual decision for the identification of at least Asian characters based on the assumptions of the individual character classification designed for Asian characters and their corresponding probabilities for all parts of segmentation.
Latin characters are analyzed at the classification level of individual characters in this embodiment of the invention. Since only segments of segmentation where doubt occurs during the classification of individual characters designed for Asian characters are analyzed with the classification of individual characters for Latin characters, the entire process is fast.
In one embodiment of the present invention, step (B) further comprises, for each segmentation portion, the substeps of: • verifying whether the first plurality of hypotheses meets at least one condition, and * apply, if at least one condition is satisfied, a classification of individual characters designed for Latin characters to provide a second plurality of hypotheses on the at least one character represented by the segmentation part and a second plurality of probabilities associated with the hypotheses , and wherein the contextual decision of revert (G) is a contextual decision for the identification of characters based on the assumptions of the individual character classification designed for Asian characters and their corresponding probabilities and the assumptions of the classification of characters. individual designed for His Latin characters and their probabilities corre spondantes for all parts of segmentation.
The Latin characters are analyzed twice during the application of this embodiment of the invention, at two different levels (individual character classification and full OCR engine), which gives excellent accuracy. Since only those parts of segmentation where doubt occurs during the classification of individual characters designed for Asian characters are analyzed with the individual character classification for Latin characters and possibly with the complete OCR engine designed for Latin characters, the complete process is fast.
According to a fourth aspect, the invention provides a computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a string image, the control logic comprising: (i) first computer readable program code means for applying an OCR engine designed for Asian characters to provide identification of Asian characters and non-Asian character area in the string image (ii) second computer-readable program code means for applying to the non-Asian character areas an OCR engine designed for Latin characters to provide Latin character identification, and (iii) third code means of computer-readable program to combine the characters identified by the RO engine C designed for Asian characters and the OCR engine designed for Latin characters to obtain identification of Asian and Latin characters of the string image, in which the OCR engine designed for Asian characters includes the steps of to: (A) segment the character string into segments of segmentation, (B) apply, for each segmentation part, a classification of individual characters designed for Asian characters to provide a first plurality of hypotheses on at least one character represented by the segmentation portion and a first plurality of probabilities associated with the assumptions, and (G) making a contextual decision for the identification of at least Asian characters based on the assumptions of the individual character classification designed for the Asian characters. and their corresponding probabilities for all parts of seg mentation.
Such a computer product enables the invention to be applied very efficiently to provide identification in a machine-readable code of the characters represented by the character string image.
According to a fifth aspect, the invention provides a method for identifying characters in a character string image, the method comprising the steps of: (A) segmenting the character string image into segmentation portions, (B) for each segmentation portion, the substeps of: (a) applying a first character classification to provide a first plurality of hypotheses on at least one character represented by the segmentation portion and a first plurality of probabilities associated with hypotheses, (b) verify that the first plurality of hypotheses satisfies at least one condition, (c) apply, if at least one condition is satisfied, a second classification of characters to provide a second plurality of hypotheses on ie at least one a character represented by the segmentation portion and a second plurality of probabilities associated with the assumptions, and (C) taking a contextual decision for the identification of characters of the at least first type of characters based on the assumptions of the first character classification with their corresponding probabilities and assumptions of the second character classification with their probabilities for all parts of segmentation.
In such a method, the characters of the first type are directly analyzed by the first classification of characters, their processing is fast and accurate. Only those characters in which a doubt exists after the first classification of characters are analyzed by a second classification of characters, the doubt being evaluated by verification. This selection of characters to be analyzed by the second classification of characters makes the process particularly fast.
According to a sixth aspect, the invention provides a computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a string image. characters, the control logic comprising computer-readable program code means for segment® * character string image into segmentation portions and comprising, for each segmentation portion: (A) first program code readable code means; computer for segmenting the string image into segmentation portions, (B) second computer-readable program code means for applying, for each segmentation portion, the substeps consisting of: (a) a first classification of characters to provide a first plurality of hypotheses on at least one character represented by the segmentation portion and a first plurality of probabilities associated with the assumptions, (b) a check to see if the first plurality of hypotheses meets at least one condition, (c) if at least one condition is satisfied, a second character classification to provide a second plurality of assumptions on the at least one character represented by the segmentation portion and a second plurality of probabilities associated with the assumptions, and (C) third means of computer readable program code for making a contextual decision for the identification characters of the at least first type of characters based on the assumptions of the first character classification with their corresponding probabilities and assumptions of the second character classification with their probabilities for all parts of segmentation.
Such a computer product makes it possible to apply the invention very efficiently in order to provide a "machine-readable code identification of the characters represented by the string image.
In a seventh aspect, the invention provides a method for identifying characters in a string image, the method comprising: (i) applying a first OCR engine to provide character identification of at least a first type of characters and areas of at least a second type of character in the string image, (H) the application on the fields of the at least second character type of a second OCR engine to provide a identifying characters of a second type of characters, and (iii) combining the characters identified by the first OCR engine and the second OCR engine to obtain (Identification of character string characters.
In such a method, the characters of the first type are directly analyzed by the first OCR engine, their processing is fast and accurate. A second OCR engine is used only in areas where a type of character other than the first type of character has been detected to increase the character identification accuracy of the second type of character. The fact that the second OCR engine is only used in areas where a type of character other than the first type of character has been detected causes the second OCR engine to be used only when necessary. For example, the second OCR engine is not used at all in a text comprising only characters of the first type, but if a text also contains the second type of characters, the accuracy of their identification is high.
In one embodiment of the invention, the first OCR engine includes at least one of the following: (a) using a character database comprising characters of the first type of characters; {b) segmentation designed for the first type of character; (c) character classification designed for the first type of characters; and (d) contextual decision designed for the first type of characters.
In an embodiment according to the invention, the second ROC engine comprises at least one of the following; (a) using a character database comprising characters of the second type of characters; (b) segmentation designed for the second type of characters; (c) character classification designed for the second type of characters; and (d) contextual decision designed for the second type of characters.
Many parts of an OCR engine can be specially designed for a given type of character. For example, a segmentation using atoms, where the segmentation portions comprise between one and five atoms, may be especially suitable for Asian characters, while a segmentation based on the detection of inter-character breaks may be especially appropriate for the characters. Latin characters. The character classifications used for Latin and Asian characters may be different because they calculate the characteristics differently. Contextual decisions are also different. Latin language modules in particular use a language dictionary, while Asian language models use in particular n-grams of characters. All in all, there are many ways to specifically perform OCR for a given type of character.
According to an eighth aspect, the invention provides a computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a string image, control logic comprising: (i) first computer readable program code means for applying a first OCR moteür to provide character identification of at least a first type of characters and areas of at least a second character; type of characters in the character string image, (ii) second computer-readable program code means for applying to the fields of at least one second type of characters a second OCR engine to provide character identification of a second type of characters, and (iii) third computer-readable program code means for combining characters identified by the first OCR engine and the second OCR engine to obtain character identification of the character string image.
Such a computer product makes it possible to apply the invention very efficiently in order to provide identification in a machine-readable code of the characters represented by the character string image.
Brief description of the drawings
For a better understanding of the present invention, reference will now be made, by way of example only, to the accompanying drawings in which:
Figure 1 shows a flowchart of a method of OCR according to the state of the art.
Figure 2 shows a flow diagram of an OCR engine according to the state of the art.
Figure 3 shows a flowchart of a combination step combining segmentation and classification of individual characters according to one embodiment of the invention.
Figure 4 illustrates the similarity between certain Asian and Latin characters, the resemblance being used in one embodiment of the present invention.
Figure 5 illustrates a segmentation and classification of individual characters combined according to an embodiment of the present invention.
Figure 6 shows a flowchart of an OCR method according to an embodiment of the invention.
Description of the invention
The present invention will be described in connection with particular embodiments and with reference to certain drawings but the invention is however not limited thereto. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn to scale for illustrative purposes.
In addition, the terms first, second, third and similar in the description and in the claims are used to distinguish between similar elements and not necessarily to describe a sequential or chronological order. The terms are interchangeable under appropriate circumstances and the embodiments of the invention may operate in other sequences than those described or illustrated herein.
In addition, the various embodiments, although referred to as "preferred", should be interpreted as exemplary ways in which
Rnvention can be implemented rather than limiting the scope of the invention.
The term "comprising", as used in the claims, should not be construed as being limited to the means or steps listed below; it does not exclude other elements or steps. It must be interpreted as specifying the presence of the elements, integers, steps or components referred to but does not exclude the presence or addition of one or more other elements, integers, steps or components or groups of these. Therefore, the scope of the term "a device comprising A and B" should not be limited to devices comprising only components A and B, on the contrary, with respect to the present invention, the only listed components of the device are A and B, and the claim should also be interpreted to include equivalents of these components.
The term character, as used in this document, refers to a symbol or sign used in writing as a grapheme, a logogram, an alphabetical letter, a lettering, a numeric character, or a punctuation mark.
The terms "designed for", when betting for example on an OCR engine or a classification designed for a type of character, refer to the fact that the OCR engine or classification has been optimized to be particularly fast and accurate in terms of the identification or classification of this type of character, which may be for example the Asian type of character, the Latin type of character, the Arabic type of character, ... An OCR engine or a Classification designed for a type of characters uses a database consisting of character patterns of this type of characters. All character types can include punctuation, numeric characters, and symbols. An OCR engine or classification designed for a given type of character may be able to identify or classify other types of characters, but an OCR engine designed for Asian characters is less accurate in recognizing characters. than an OCR engine designed for Latin characters.
The terms identification and ID as used herein refer to recognizing one or more characters in machine-readable code to obtain a searchable character string. The ID or ID is a result of an OCR engine.
The term classification as used in this document refers to the generation of a plurality of assumptions about the identification of one or more characters. Each identification hypothesis, or ID hypothesis, is associated with a probability that the identification hypothesis is correct, that is, the image or segmentation part that is subject to the classification effectively represents the character or group of characters of the identification hypothesis. Classification can be done by a dedicated program called a classifier. A classification designed to classify characters one by one or ligatures is called individual character classification. The term "classify" as used herein means "achieve classification." Even when the invention is described with embodiments comprising a classification of individual characters, it should be understood that the scope of the present invention extends to other types of classification, including classification other than individual characters.
The terms assumptions, ID assumptions, and dentification assumptions as used in this document refer to alternative solutions for the identification of a character or group of characters. One or more assumptions are the result of the classification and a decision has yet to be made regarding the result of the OCR engine. An assumption is a predefined pattern of a character or group of characters. Assumptions can also be called possible solutions or candidates.
The terms atom and blob as used herein refer to most of an image which is made up of pixels of a given color that touch each other. For example, in a black and white image, an atom or blob is a set of black pixels connected to each other by black pixels.
The term contextual decision as used in this document refers to a decision based on the context of a character to decide the identification of that character. For example, an entire word may be taken into consideration when deciding on the identification of each letter of the word.
Figure 1 shows a flowchart of a method of ROC 100 according to the state of the art. An image 101 of a character string is taken as input by an OCR engine 102. The OCR engine 102 processes the information contained in the image and provides an ID 103 of the characters of the character string of the image. 101 as a result.
Figure 2 shows a flow diagram of an OCR motor 102 according to the state of the art. The OCR engine 102 includes a step 201 that combines segmentation 202 and individual character classification 203.
Segmentation 202 is a division of the image of character string 101 into segments of segmentation that possibly correspond to characters. A segmentation part is a part of the image of the character string 101 that is subjected to processes to determine whether it represents a character, a group of characters, a shape, ... Numerous alternative divisions of the image of the string strings are typically taken into consideration during step 201 combining segmentation 202 and individual character classification 203. If the image of character string 101 is an image of a line of characters, a segmentation portion is a part of that character. line of characters. Since a segmentation part is a part of an image, a segmentation part is itself an image.
The classification of individual characters 203 generates, for a part of segmentation, one or more hypotheses with their associated probabilities. The individual character classification 203 typically calculates, from among a series of character patterns, the patterns that have the highest probabilities of matching the character represented on the segmentation portion.
The classification of individual characters 203 generates, in association with each ID hypothesis for a given segmentation part, a probability that this ID hypothesis is correct, i.e., a probability that the segmentation portion actually represents this character. The probability may be, for example, a percentage or a weight of likelihood.
A more general character classification than the "individual character classification" can be used in the OCR 102 engine. It can identify groups of characters (eg ligatures), shapes, logos, drawings ... The step The combination 201 alternates between the segmentation 202 and the individual character classification 203 to generate a series of hypotheses 204A on the character ID of the character string image 101 and associated probabilities 2048.
The series of hypotheses 204A on the ID of the characters of the image of the character string 101, associated with their probabilities 204B, is then analyzed during a contextual decision step 205 which determines, among the hypotheses on the ID of the characters 204A, the hypotheses with the highest overall probability for the complete picture of the character string 101. The assumption with the highest overall probability is identified as the character ID 103 and is the result of the search engine. OCR 102.
Figure 3 shows a flowchart of a step 390 combining segmentation and classification of individual characters according to one embodiment of the invention. A segmentation 350 generates a segmentation portion 351. In one embodiment of the present invention, the segmentation 350 is based on the detection of intercharacter breaks or word breaks. In a further embodiment of the present invention, the segmentation 350 generates atoms, which are a set of pixels of a given color, and a segmentation portion comprises from one to five atoms, in yet another embodiment of the present invention. In the present invention, the segmentation 350 comprises the steps of: • determining a first "ordinate" starting point of a pixel contrasting with the background, • generating a list of potential character widths dependent on a width of maximum character and characteristics of the part of the character string image corresponding to the maximum character width, and • determining a second part of the character string image corresponding to the first start point coordinate and to the first width on the list of potential character widths.
The segmentation portion 351 is then classified by an individual character classification step 300 according to one embodiment of the invention.
In the individual character classification 300, the segmentation portion 351 is first parsed by an individual character classification 301 designed for the Asian characters to generate one or more assumptions 302A over ΓΙΟ of the character represented by the segmentation portion 351 and associated probabilities 302B. In one embodiment of the invention, the probability is a number in the interval [0,1] where 1 indicates an excellent match between an ID hypothesis and the segmentation portion 351 and 0 a very poor match between a ID hypothesis and the segmentation portion 351, The only one or more 302A hypotheses on HD of the character and the associated probabilities 302B are a result of the combining step 390.
In one embodiment of the invention, the classification of individual characters designed for Asian characters 301 does not include the possibility of recognizing Latin characters, and all assumptions 302A relate to Asian characters. In one embodiment of the invention, the individual character classification for Asian characters 301 includes the ability to recognize Latin characters, and the 302A assumptions relate to Asian or Latin characters.
In one embodiment of the invention, the individual character classification for Asian characters 301 includes a feature extraction step that generates a feature vector. The feature extraction step uses a Gabor filter which is a sinusoidal wave multiplied by a Gaussian function. The feature vector is used to generate the probabilities of character patterns from a predetermined list.
A verification step 303 checks whether the one or more hypotheses 302A on the character ID satisfies at least one condition in a list of one or more conditions.
In one embodiment of the present invention, one of the conditions of the list is that all the probabilities 302B of the assumptions 302A are below a given threshold. A high threshold means that the condition in the verification step 303 is easily satisfied and that many segments of segmentation 351 will be analyzed by the second individual character classification 305 designed for the Latin characters as will be described later, which increases the accuracy of the overall OCR process but decreases its speed. A low threshold means that the condition in the verification step 303 is not easily satisfied and that few segments of segmentation 351 will be analyzed by the individual character classification 305, which increases the speed of the process but decreases its accuracy. In one embodiment of the invention, the threshold is 0.7, which is a compromise between speed and accuracy.
In one embodiment of the present invention, one of the conditions of the list in step 303 is that at least one of the only hypothesis (s) 302A relates to an Asian character known to resemble a Latin character. Figure 4 illustrates the similarity between some Asian and Latin characters. The Asian character X 501 (CJC 5DE5 Unified Ideogram) resembles I 502 (the Latin capital letter i). The right-hand side of the Asian character it 503 (Unified Ideogram CJC 5317) can be confused with a t 504. The Asian character B 505 (unified ideogram CJC 52F2) resembles a 2 (number two) 506. The Asian character 507 (unified ideogram CJC 8BB8) can be confused with the letters i and F 508. The Asian character û] 509 (Syllable Hangul i) can be confused with the letters o and 1510 (O and L in lower case). For example, the individual character classification 301 designed for Asian characters can provide, for a given segmentation portion 351, the Asian character EL 505 as an assumption with a high probability, even if the segmentation portion actually represents the character 2. It is interesting for the accuracy that such part of segmentation is analyzed by a classification of individual characters for Latin characters.
In one embodiment of the present invention, where the individual character classification 301 designed for Asian characters includes the possibility of recognizing Latin characters, one of the conditions of the list is that a hypothesis among the 302A assumptions relates to a Latin character. .
In one embodiment of the present invention, where the individual character classification 301 designed for Asian characters includes the possibility of recognizing Latin characters, one of the conditions of the list is that a probability assumption equal to or greater than a threshold. among the assumptions 302A concerns a Latin character. In another embodiment of the invention, the threshold is equal to 50%.
In one embodiment of the present invention, where the individual character classification 301 designed for Asian characters includes the possibility of recognizing Latin characters, one of the conditions of the list is that a hypothesis of the highest probability among the hypotheses 302A concerns a Latin character.
If at least one of the conditions of the list is satisfied at the verification step 303, the individual character classification 300 continues with an individual character classification 305 designed for Latin characters.
The result of the individual character classification 305 designed for the Latin characters is one or more assumptions 306A on a Latin ID of the character on the segmentation portion 351, associated with weights 306B. In one embodiment of the invention, the weights 306B are numbers in the range [0.255] where the number 0 indicates an excellent match between an ID hypothesis and the segmentation portion 351 and the number 255 a very bad correspondence between an ID hypothesis and the segmentation portion 351.
In one embodiment of the present invention, a step of scaling the weights 307 is necessary to match the Latin weight scale WLatin 306B of the Latin ID assumptions 306A to the 302B Asian probabilities scale. assumptions of Asian ID 3Ö2A, A transformed Latin probability Puuinjtransformea 3088 is calculated by the formula
The weight scaling step 307 results in the 30SA Latin character ID assumptions on the segmentation portion, which are the same as the Latin ID assumptions 306A, but now associated with the probabilities. transformed Latinos 3088 that can be compared directly to the 302B probabilities that the 302A assumptions about the Asian ID of the character are correct. The only one or more Latin 308A assumptions of the character on the segmentation portion with their associated probabilities 308B are a result of the combining step 390. The location 312 of the segmentation portion 351 that has been subjected to the classification The combination step 390 is then another result of the combining step 390. The combining step 390 then uses the segmentation portion 350 to generate a next segmentation portion 351.
In one embodiment of the present invention, once the combining step 390 is performed on a complete picture of the character string to generate one or more Asian PID assumptions 302A with their associated probabilities 302B and one or more 308A assumptions on the Latin ID with their associated probabilities 308B on all segments of segmentation, a contextual decision step is performed to determine the combination of the assumptions 302A and 308A which provides the identification of the string of characters.
In one embodiment of the present invention, at least one of the individual character classifications 301 and 305 is performed by a classifier of individual characters.
In one embodiment of the present invention, the individual character classifications 301 and 305 are extended to classify groups of characters. In one embodiment of the present invention, the individual character classifications 301 and 305 are extended to classify ligatures. Even though FIG. 3 describes an embodiment of the invention where the first classification 301 is designed for Asian character recognition and the second classification 305 is designed for Latin character recognition, the invention can be used for other types of characters such as Arabic characters, Cyrillic characters, Greek characters, Hebrew characters, hieroglyphs, etc ...
Figure 5 illustrates a segmentation and classification of individual characters combined according to one embodiment of the present invention.
The segmentation first divides the image of the character string 101 into four segmentation portions 602, 603, 604 and 605 to generate a first plurality 601 of segmentation portions. Then the classification of individual characters analyzes the first segmentation segment 602 and generates n602 hypotheses Ceo2-i at Cm-mz, each hypothesis Csoa-i having an associated probability Then the classification of individual characters analyzes first segment 603 and generates n603 assumptions C003-1 to Ceo3 * 603, each Ceow hypothesis having an associated probability Pm-h The classification of individual characters is repeated four times, since the first plurality 601 of segmentation parts contains four segmentation portions 602, 603,604 and 605 .
Then, the segmentation divides the image of the character string 101 into six segmentation portions 607, 608, 609, 610, 611 and 612 to generate a second plurality 602 of segmentation portions. Then the classification of individual characters analyzes the first part of segmentation 607 and generates n607 hypotheses Ceo -I to Cec7 ^ eo "Each hypothesis Cmu having an associated probability P ^. Then the classification of individual characters analyzes the second segmentation segment 608 and generates n608 Ceoa-i hypotheses at Ceas-neoe, each Ceow hypothesis having an associated probability Pgow · The classification of individual characters is repeated six times, since the second plurality 606 parts of
Segmentation contains six segments of segmentation 607, 608, 609, 610, 611 and 612.. Segmentation and the series of individual character classifications are repeated a number of times to provide hypotheses about the character ID.
In one embodiment of the present invention, the process illustrated in Figure 5 is used in combination with the flowchart of Figure 3.
Figure 6 shows a flowchart of a method of OCR 400 according to one embodiment of the invention. An image 401 of a character string is taken as input by an OCR 402 engine. In one embodiment of the invention, the image 401 of a character string is a horizontal line or part of a horizontal line. In another embodiment of the present invention, the image 401 of a character string is a vertical line or part of a vertical line.
The first ROC 402 engine is designed for a first type of character. In one embodiment of the invention, the first ROC engine 402 is designed for Asian characters. In one embodiment of the invention, the first ROC 402 engine is designed for Asian characters and includes the ability to recognize Latin characters. In one embodiment of the invention, the first ROC engine 402 uses a combining step 390 combining segmentation and classification of individual characters using two individual character dassifiers as illustrated in Figure 3. In one embodiment of the the first ROC engine 402 uses a classification of individual characters 300 as illustrated in FIG. 3, The first ROC engine 402 generates an ID 403 for the characters of the first type and determines zones 404 of characters of another type in image 401 of the string,
In one embodiment of the invention, the zones 404 of characters of another type in the image 401 are areas where the ROC 402 first-engine could not identify characters of the first type with a good one. level of confidence, the confidence level being judged according to a predefined criterion.
In a further embodiment of the invention, the character 404 areas of another type in the image 401 are sets of contiguous segmentation portions. In another embodiment of the invention, the areas 404 of characters of another type in the image 401 are sets of non-contiguous segmentation portions.
In one embodiment of the invention, the character 404 areas of another type in the image 401 contain at least two characters. In one embodiment of the invention, the character 404 areas of another type in the image 401 contain at least two contiguous characters because the contextual decision step of the second ROC 405 engine is more accurate when it is done with at least two characters.
In one embodiment of the invention, an area 404 of characters of another type in the image 401 is no longer than a line of text. In one embodiment of the invention, a zone 404 of characters of another type in the image 401 is not longer than a text column,
In one embodiment of the present invention, the character 404 areas of another type in the image 401 are selected as large as possible because the second ROC 405 engine is more accurate when working with large areas comprising many characters only in small areas with only one or a few characters.
In an embodiment of (Invention, the character 404 areas of another type in the image 401 are sets of segmentation portions, at least one segmentation portion satisfying at least one of the following conditions that are controlled by the first ROC 402 engine: * a character of a character type other than the first type of character was detected on the segmentation part by the first ROC 402 engine • the probability of identification on the segmentation part by the first ROC 402 engine is below a threshold; * the identification on the segmentation part corresponds to a character of the first type of characters known to resemble a character of another type of characters; and • a character of the second type of characters was identified on said segmentation portion by the first OCR engine.
Zones 404 of another type of characters are then processed by the second ROC engine 405. In one embodiment of the invention, the second ROC engine 405 is designed for a second type of character. In one embodiment of the invention, the second ROC 405 engine is designed for Latin characters. In one embodiment of the invention, the second ROC 405 engine detects the language of Latin characters. In one embodiment of the invention, the second ROC 405 engine uses English as the default assumption for Latin character language. In one embodiment of the invention, the second ROC engine 405 uses a predetermined language for the Latin character language. In one embodiment of the invention, the second engine of ROC 405 is a multilingual OCR. In one embodiment of the invention, the second ROC engine 405 uses a combination step 390 combining segmentation and classification of individual characters as illustrated in FIG. 3. In one embodiment of the invention, the second engine of ROC 405 uses a classification of individual characters 300 as illustrated in Figure 3. In one embodiment of the invention, the second ROC engine 405 uses only one individual character.
The second ROC engine 405 generates an ID 4 (¾ for the characters of the second type.) The result 407 of the ROC 400 method is a combination of f'ID of the characters of the first type 403 and ΓΙΟ characters of the second type 406 in the second type. an order corresponding to the position of the characters in the character string image 401,
In an embodiment of the invention, at least one of the first ROC engine 402 and the second ROC engine 405 uses a pretreatment of the character string image 401 to achieve image enhancement. In one embodiment of the invention, at least one of the first ROC engine 402 and the second ROC engine 405 uses a binarization of the character string image 401 to separate the foreground and backplane. plane of the character string image 401.
In one embodiment of the present invention, contextual decisions on decision graphs are used in at least one of the first ROC engine 402 and the second ROC engine 405. The result of the combination step combining segmentation and classification of individual characters of the considered OCR engine, which is a series of PlD assumptions of the characters and probabilities associated with these assumptions, is used to generate a first decision graph. In a decision graph, an arc is created for each hypothesis, associated with its segmentation part and its probability. An arc for a segmentation portion begins at a node where the segmentation portion terminates just prior to said segmentation portion on the character string image 401 and ends at a node where all segmentation portions begin immediately after said portion segmentation on the character string image 401.
Several decision models are then used to improve the identification process. The decision models analyze the hypotheses emitted by the step combining segmentation and classification of individual characters, and add additional probabilities of identification that are added to the decision graph. The character ID of the string image taken as a result of the considered OCR engine is determined by finding the path in the decision graph that maximizes the likelihood of identification for the complete string of characters.
A first type of decision model is a linguistic model. If the word "omate *" is present in the character string image to be identified, a combination step combining segmentation and classification of individual characters can find, for example, the word "omate" and the word "omate *" as hypotheses with similar probabilities because the letters rn taken together resemble the letter m. A linguistic model using a dictionary is able to detect that the word "omate" does not exist, while the word "omate" exists.
In one embodiment of the present invention, the language model uses a η-grams model. If the word "TRESMEUR *" is present in the character string image 401, a combination step combining segmentation and classification of individual characters may find, for example, the word "TRESMEUR" and the word TRESMEUR "as hypotheses with probabilities similar because the letter 'S' may look like the letter '5 * in a printed text. A linguistic model using a bigram model (n-gram with n = 2) would prefer TRESMEUR 'if' ES * and 'SM * have better probabilities of occurrence than' E5 * and '5M *.
Another type of model used in one embodiment of the present invention is a typographic model. If the word Toguivy "is present in the character string image 401, a combination step combining segmentation and classification of individual characters can find, for example, the word Toguivy" and the word Toguiw "as hypotheses with similar probabilities because that the letter y may look like the letter V in a printed text. A typographic model using font metrics would prefer Toguivy because the bottom position of the final character is more likely to be the bottom position of a y (in its model) than to a V.
In one embodiment of the present invention, the typographic model considers the position of the character in the image to control whether tariffs and positions are expected or not.
These examples of linguistic and typographic models clearly show why it is advantageous that the areas 404 of characters of another type in the image 401 be as large as possible and contain at least two contiguous characters, since the contextual models are more predicted when they work on multiple characters.
In one embodiment of the present invention, an OCR engine designed for Asian characters considers a character line in the character string image for its input and a decision graph is generated for each line of characters. In one embodiment of the present invention, an OCR engine designed for Latin characters considers a word as the string image for its input and a decision graph is generated for each word.
In one embodiment of the present invention, a second OCR engine determines zones of a third type of characters in the character string image 401 and a third OCR engine, designed for the third character type, processes the areas of the third type of characters. In a further embodiment of the invention, several types of characters are similarly considered, with OCR engines designed for different types of characters working one after the other in cascade or in parallel with each other. other, and depending on compliance with given conditions, the overall OCR method will use more or fewer OCR engines to identify characters.
An OCR method according to an embodiment of the present invention can be described as follows. The image of a line of characters, which is expected to be mostly Asian characters, is analyzed by a first OCR engine. The segmentation of the first OCR engine is a segmentation designed for Asian characters. It divides the image of a line of characters into atoms and then generates segments of segmentation that comprise between one and five atoms. Part of segmentation is first analyzed by a classification of individual characters designed for Asian characters but capable of classifying Latin characters, to generate one or different Asian assumptions on said part of segmentation, with their associated probabilities. Probabilities are numbers in the range {0,1J.
One or the different Asian assumptions are verified according to three conditions. If none of the conditions are satisfied, the one or several Asian assumptions, with their associated probabilities, are taken as the sole result of the classification of individual characters, the conditions being: • All the probabilities associated with the Asian assumptions are in below 0.7. • One of the Asian hypotheses is a Latin character with a probability of at least 0.5. • One of the Asian assumptions is an Asian character that is known to look like a Latin character.
If at least one of these conditions is satisfied, the segmentation portion is then parsed by an individual character classification designed for the Latin characters, to generate one or different Latin assumptions on said segmentation portion, with their associated weights. Weights, which are numbers in the range [0.255] are transformed according to the formula
to generate probabilities that can be directly compared to the probabilities of Asian assumptions.
The results of the classification of individual characters are then (1) one or different Asian hypotheses, with their associated probabilities; (2) one or different Latin hypotheses, with their associated probabilities; and (3) the locations of one or more Latin hypotheses.
Then the process is repeated on the next segmentation part.
Once all segmentation parts of the string image are passed through the classifier of individual Asian characters, and possibly by the classifier of individual Latin characters, all Asian assumptions and their probabilities and Latin assumptions and their probabilities are used to generate a first decision graph. A contextual decision designed for Asian characters, but capable of handling Latin characters, is then applied to generate a second decision graph. The path along the second decision graph that maximizes the probability of identification of the complete string of characters is taken as a result of the first OCR engine. This path contains Asian characters and may also contain Latin characters.
If this path contains Latin characters, that is, if Latin characters are identified by the first OCR engine, the zone locations of these Latin characters are passed to a second OCR engine that is designed for the characters. Latin. Indeed, since these areas contain Latin characters, it is interesting, in order to improve the accuracy, to apply a complete OCR engine designed for Latin characters, including a contextual decision designed for Latin characters, which is not done so far. A zone is defined as the largest area of the character line image that contains only Latin characters. An area contains at least two characters. It is preferable that these areas be as large as possible because the more characters analyzed at one time by the contextual decision methods are numerous, the more accurate these processes are. The second OCR engine identifies the typeface on these areas.
The Asian characters identified by the first OCR engine and the Latin characters identified by the second OCR engine are then ordered according to their location on the string image to be (Identification of the characters in the image of the string of characters. characters.

权利要求:
Claims (28)
[1]
CLAIMS:
A method for identifying characters in a character string image 401, the method comprising: (i) applying a first OCR engine 402 to provide a character identification 403 of at least a first type of characters and zones 404 of at least a second type of characters in the character string image 401, (ii) application on the zones 404 of the at least second character type of a second ROC engine 405 to provide a character identification 406 of a second type of characters, and (in) the combination of the characters identified by the first ROC engine 402 and by the second ROC engine 405 to obtain the identification of the characters of the chain image 401, wherein the first ROC engine 402 comprises a segmentation 350 of the character string image 401 into segmentation portions 351 and comprises, for each segmentation portion 351, the steps of (a) a implicating a first character classification 301 to provide a first plurality of hypotheses 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with the hypotheses 302A, (b) checking 303 whether the first plurality Assumptions 302A satisfies at least one condition, (c) apply, if at least one condition is met, a second character classification 305 to provide a second plurality of assumptions 308A on the at least one character represented by the portion of segmentation 351 and a second plurality of probabilities 308B associated with assumptions 308A.
[2]
The method of claim 1, wherein a first OCR engine 402 uses a character database comprising characters of the first type of characters.
[3]
The method of claim 2, wherein the first character classification 301 uses a character database comprising characters of the first type of characters.
[4]
The method of claim 3, wherein the first character classification 301 is capable of detecting characters of at least one type of characters other than the first type of characters.
[5]
The method of claim 4, wherein the first type of character is an Asian type of character.
[6]
The method of claim 4, wherein the second ROC engine 405 uses a character database comprising characters of the second type of characters.
[7]
The method of claim 6, wherein the second character classification 305 uses a character database comprising characters of the second type of characters.
[8]
The method of claim 7, wherein the second type of characters is a Latin type of characters.
[9]
9. The method of claim 3, the first type of characters is the Latin type of characters, the Arabic type of characters, the Hebrew type of characters, the Cyrillic type of characters, the Greek type of characters or the hieroglyph type of characters.
[10]
The method of claim 7, wherein the second type of character is the Asian type of characters, the Arabic type of characters, the Hebrew type of characters, the Cyrillic type of characters, the Greek type of characters, or the hieroglyph type of characters. characters.
[11]
The method of claim 1, wherein at least one of the first character classification 301 and the second character classification 305 is a classification of individual characters.
[12]
The method of claim 1, wherein the areas 404 of the at least second type of characters are groups of segmentation portions 351 in which certain segmentation portions 351 satisfy at least one of the following conditions: • all the probabilities 302B of the Assumptions 302A provided by the first character classification for said segmentation portion 351 are below a given threshold; A hypothesis among the hypotheses 302A for said first segmentation portion 351 relates to a character of the first type of characters known to resemble a character of the at least second type of characters; A hypothesis among the hypotheses 302A for said segmentation portion 351 concerns a type of characters other than the first type of characters; and a character of the second type of characters has been identified on said segmentation portion by the first OCR engine 402.
[13]
The method of claim 1, wherein a field 404 of the at least second type of characters is a group of segmentation portions 351 which comprises at least two characters.
[14]
The method according to claim 4, wherein said at least one condition is one of the following conditions: • all the probabilities 302B of the assumptions 302A are below a given threshold; A hypothesis among the hypotheses 302A relates to a character of the first type of characters known to resemble a character of the at least second type of characters; and • an assumption with a probability greater than a given threshold among the 302A assumptions relates to a type of character other than the first type of character.
[15]
The method of claim 1, wherein segmenting 350 of the character string image into segmentation portions 351 comprises the steps of: determining a first start point coordinate of a contrasting pixel with the back -plan, • generate a list of potential character widths depending on a maximum character width and characteristics of the part of the character string image corresponding to the maximum character width, and • determine a second part of the character width. A string image corresponding to the first start point coordinate and the first width on the list of potential character widths.
[16]
The method of claim 1, wherein: the first plurality of probabilities 302B corresponds to a first probability scale, the second plurality of probabilities 308B corresponds to a second probability scale, and the method comprises a step 307 consisting of transforming at least one of the first plurality of probabilities 302B and the second plurality of probabilities 308B to scale the first or second scale of probabilities such that the first plurality of probabilities 302B and the second plurality of probabilities 308B can to be compared to obtain a first or second plurality of transformed probabilities.
[17]
The method of claim 1, wherein the first ROC engine 402 includes a step of performing a contextual decision for identifying characters of the at least first type of characters based on the assumptions 302A of the first character classification. 301 with their corresponding probabilities 302B and hypotheses 308A of the second character classification 305 with their corresponding probabilities 308B for all segmentation portions 351.
[18]
The method of claim 17, wherein the step of making a contextual decision utilizes at least one of the decision support tools from a decision graph, a language model, a typographic model, and a decision model based on a model n-grams.
[19]
A computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a character string image 401, the control logic comprising: i) first computer readable program code means for applying a first OCR engine 402 to provide a character identification 403 of at least a first type of characters and zones 404 of at least a second type of characters in the character string image 401; (ii) second computer-readable program code means for applying to the fields 404 of at least one second type of characters a second OCR engine 405 to provide a character identification 406. a second type of characters, and (iii) third computer-readable program code means for combining the characters identified by ar the first ROC engine 402 and the second ROC engine 405 to obtain the character identification of the character string image 401, wherein the first ROC engine 402 includes a segmentation 350 of the string image of characters 401 into segmentation portions 351 and comprises, for each segmentation portion 351, the steps of (a) applying a first character classification 301 to provide a first plurality of hypotheses 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with assumptions 302A, (b) checking 303 whether the first plurality of hypotheses 302A meets at least one condition, and (c) applying, if at least one condition is satisfied, a second character classification 305 for providing a second plurality of assumptions 308A on the at least one character represented by the Segmentation portion 35 1 and a second plurality of probabilities 308B associated with the assumptions 308A.
[20]
A method for identifying characters in a character string image 401, the method comprising: (i) applying an OCR engine designed for Asian characters to provide identification of Asian characters and non-character area. in the character string image 401, (ii) the application on non-Asian character areas of an OCR engine designed for Latin characters to provide Latin character identification, and (iii) the combination of characters identified by the OCR engine designed for Asian characters and the OCR engine designed for Latin characters to obtain the identification of Asian and Latin characters from the character string image 401, in which the OCR engine designed for the Asian characters comprises the steps of: (A) segmenting the character string image 401 into segments of segmentation 351, (B) applying, for each segmentation portion 351, an individual character classification designed for the Asian characters 301 to provide a first plurality of hypotheses 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with the hypotheses 302A , and (C) making a contextual decision for the identification of at least Asian characters based on the 302A assumptions of the individual character classification designed for the Asian characters 301 and their corresponding probabilities 302B for all the segmentation portions 351,
[21]
The method of claim 20, wherein step (B) further comprises, for each segmentation portion 351, the substeps of: • checking 303 whether the first plurality of assumptions 302A satisfies at least one condition, and • applying, if at least one condition is satisfied, a second character classification designed for the Latin characters 305 to provide a second plurality of assumptions 308A on the at least one character represented by the segmentation portion 351 and a second plurality of probabilities 308B associated with the assumptions 308A, and wherein the contextual decision of the step (C) is a contextual decision for character identification based on the assumptions 302A of the individual character classification designed for the Asian characters 301 and their corresponding 302B probabilities and 308A assumptions of the individual character classification designed to r the Latin characters 305 and their corresponding probabilities 308B for all segmentation parts 351.
[22]
A computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a character string image 401, the control logic comprising: i) first computer readable program code means for applying an OCR engine designed for Asian characters to provide identification of Asian characters and non-Asian character area in the character string image 401, (ii) second computer-readable program code means for applying to the non-Asian character areas an OCR engine designed for Latin characters to provide Latin character identification, and (ni) third computer-readable program code means to combine the characters identified by the OCR engine designed for Asian characters es and the OCR engine designed for Latin characters to obtain the identification of Asian and Latin characters of the character string image 401,. wherein the OCR engine designed for Asian characters comprises the steps of: (A) segmenting the character string image 401 into segmentation portions 351, (B) applying, for each segmentation portion 351, a classification of individual characters designed for the Asian characters 301 to provide a first plurality of hypotheses 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with the assumptions 302A, and (C) performing a contextual decision for identification of at least Asian characters based on the 302A assumptions of the individual character classification designed for the Asian characters 301 and their corresponding 302B probabilities for all the segmentation portions 351.
[23]
23. A method of identifying characters in a character string image 401, the method comprising the steps of; (A) segmenting the character string image 401 into segmentation portions 351, (B) for each segmentation portion 351, the substeps of: (a) applying a first character classification 301 to provide a first plurality of assumptions 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with the assumptions 302A, (b) checking 303 whether the first plurality of hypotheses 302A meets at least one condition, (c) applying, if at least one condition is satisfied, a second character classification 305 to provide a second plurality of assumptions 308A on the at least one character represented by the segmentation portion 351 and a second plurality of probabilities 308B associated with the Assumptions 308A, and (C) make a contextual decision for the identification of the characters of the at least first type of characters. r the basis of the hypotheses 302A of the first character classification 301 with their corresponding probabilities 302B and the hypotheses 308A of the second character classification 305 with their probabilities 308B for all the segmentation portions 351.
[24]
A computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a character string image 401, the control logic comprising means computer-readable program code code for segmenting the character string image 401 into segmentation portions 351 and comprising, for each segmentation portion 351: (A) first computer-readable program code means for segmenting 350 the image of character string 401 in segmentation portions 351, (B) of the second computer-readable program code means for applying, for each segmentation portion 351, the substeps consisting of: (a) a first character classification 301 to provide a first plurality of assumptions 302A on at least one character represented by the segmentation portion 351 and a first plurality of probabilities 302B associated with the assumptions 302A, (b) a check 303 to see if the first plurality of hypotheses 302A satisfies at least one condition, (c) a second character classification 305, if at least one condition is satisfied, to provide a second plurality of assumptions 308A on the at least one character represented by the segmentation portion 351 and a second plurality of probabilities 3088 associated with the assumptions 308A, and (C) of the third computer readable program code means to make a contextual decision for identifying the characters of the at least first character type based on the assumptions 302A of the first character classification 301 with their corresponding probabilities 302B and hypotheses 308A of the second character classification 305 with their probabilities 308B for all segments of segmentation 351,
[25]
25, A method for identifying characters in a character string image 401, the method comprising; (i) applying a first OCR engine 402 to provide a character identification 403 of at least a first type of characters and zones 404 of at least a second type of characters in the chain image of characters 401, (il) the application on the 404 areas of the at least second character type of a second OCR engine 405 to provide a character identification 406 of a second type of characters, and (iii) the combination of characters identified by the first ROC engine 402 and by the second ROC engine 405 to obtain the character identification of the character string image 401.
[26]
The method of claim 25, wherein the first OCR engine 402 comprises at least one of the following: (a) using a character database comprising characters of the first type of characters; (b) segmentation designed for the first type of characters; (c) character classification designed for the first type of characters; and {d) contextual decision designed for the first type of characters.
[27]
The method of claim 25, wherein the second ROC engine 405 comprises at least one of the following: (a) using a character database comprising characters of the second type of characters; (b) segmentation designed for the second type of characters; (c) character classification designed for the second type of characters; and (d) contextual decision designed for the second type of characters.
[28]
A computer program product comprising a medium that can be used by a computer and in which control logic is stored to cause a computing device to identify characters in a character string image 401, the control logic comprising: i) first computer readable program code means for applying a first OCR engine 402 to provide a character identification 403 of at least a first type of characters and zones 404 of at least a second type of characters in the character string image 401, (ii) of the second computer-readable program code means for applying to the zones 404 of at least one second type of characters a second OCR engine 405 to provide a character identification 406 of a second type of characters, and (iii) third computer-readable program code means for combining the characters identified s by the first OCR engine 402 and the second OCR engine 405 to obtain the identification characters of the character string image 401.

类似技术:

公开号 | 公开日 | 专利标题

BE1022562B1|2016-06-02|Optical character recognition method

US10936862B2|2021-03-02|System and method of character recognition using fully convolutional neural networks

BE1024194B1|2017-12-12|Method for identifying a character in a digital image

Sabbour et al.2013|A segmentation-free approach to Arabic and Urdu OCR

US8401293B2|2013-03-19|Word recognition of text undergoing an OCR process

BE1025503B1|2019-03-27|LINE SEGMENTATION METHOD

US8233726B1|2012-07-31|Image-domain script and language identification

FR2963695A1|2012-02-10|POLICE WEIGHT LEARNING FOR TEST SAMPLES IN MANUSCRIPTED KEYWORD REPORTING

Zahedi et al.2011|Farsi/Arabic optical font recognition using SIFT features

Simistira et al.2015|Recognition of historical Greek polytonic scripts using LSTM networks

Feild et al.2013|Improving open-vocabulary scene text recognition

EP3539051A1|2019-09-18|System and method of character recognition using fully convolutional neural networks

Ramirez et al.2014|Automatic recognition of square notation symbols in western plainchant manuscripts

Phong et al.2019|Mathematical variable detection based on convolutional neural network and support vector machine

Clausner et al.2019|ICDAR2019 Competition on Recognition of Early Indian Printed Documents–REID2019

Patel et al.2018|SVM with inverse fringe as feature for improving accuracy of Telugu OCR systems

Chamchong et al.2014|A combined method of segmentation for connected handwritten on palm leaf manuscripts

Dhar et al.2021|HP_DocPres: a method for classifying printed and handwritten texts in doctor’s prescription

Mirza et al.2019|Impact of pre-processing on recognition of cursive video text

BE1025006B1|2018-09-25|COMPUTER-IMPLEMENTED PROCESS AND OPTICAL CHARACTER RECOGNITION SYSTEM

Veena et al.2018|Handwritten off-line Kannada character/word recognition using hidden Markov model

BE1025134B1|2018-11-16|Method of identifying a character in a digital image

Zheng et al.2016|Chinese/English mixed character segmentation as semantic segmentation

Idziak et al.2021|Scalable Handwritten Text Recognition System for Lexicographic Sources of Under-Resourced Languages and Alphabets

Banerjee et al.2021|Quote examiner: verifying quoted images using web-based text similarity

同族专利:

公开号 | 公开日

WO2015189213A1|2015-12-17|

US9798943B2|2017-10-24|

BE1022562A9|2017-02-07|

US20150356365A1|2015-12-10|

BE1022562A1|2016-06-02|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5060277A|1985-10-10|1991-10-22|Palantir Corporation|Pattern classification means using feature vector regions preconstructed from reference data|

GB2306739B|1995-10-17|1999-10-13|Ibm|Improvements relating to computerized correction of numeric data|

US6047251A|1997-09-15|2000-04-04|Caere Corporation|Automatic language identification system for multilingual optical character recognition|

DE10126835B4|2001-06-01|2004-04-29|Siemens Dematic Ag|Method and device for automatically reading addresses in more than one language|

RU2251737C2|2002-10-18|2005-05-10|Аби Софтвер Лтд.|Method for automatic recognition of language of recognized text in case of multilingual recognition|

US7649838B2|2003-03-31|2010-01-19|Adknowledge, Inc.|System and method for ranking the quality of internet traffic directed from one web site to another|

EP1638463A4|2003-07-01|2007-11-21|Cardiomag Imaging Inc|Use of machine learning for classification of magneto cardiograms|

US7289123B2|2004-09-30|2007-10-30|Microsoft Corporation|Simplifying complex characters to maintain legibility|

US7805004B2|2007-02-28|2010-09-28|Microsoft Corporation|Radical set determination for HMM based east asian character recognition|

US9141607B1|2007-05-30|2015-09-22|Google Inc.|Determining optical character recognition parameters|

US8385652B2|2010-03-31|2013-02-26|Microsoft Corporation|Segmentation of textual lines in an image that include western characters and hieroglyphic characters|

US20130039589A1|2011-08-11|2013-02-14|I. R. I. S.|Pattern recognition process, computer program product and mobile terminal|

JP2013164728A|2012-02-10|2013-08-22|Canon Inc|Information processor for determining language relating to character in image|

US9043349B1|2012-11-29|2015-05-26|A9.Com, Inc.|Image-based character recognition|

US9183636B1|2014-04-16|2015-11-10|I.R.I.S.|Line segmentation method|US9805483B2|2014-08-21|2017-10-31|Microsoft Technology Licensing, Llc|Enhanced recognition of charted data|

US9524429B2|2014-08-21|2016-12-20|Microsoft Technology Licensing, Llc|Enhanced interpretation of character arrangements|

US9910852B2|2015-03-12|2018-03-06|LenovoPte. Ltd.|Detecting cascading sub-logograms|

RU2613846C2|2015-09-07|2017-03-21|Общество с ограниченной ответственностью "Аби Девелопмент"|Method and system for extracting data from images of semistructured documents|

JP6655331B2|2015-09-24|2020-02-26|Ｄｙｎａｂｏｏｋ株式会社|Electronic equipment and methods|

BE1025006B1|2017-02-27|2018-09-25|I.R.I.S.|COMPUTER-IMPLEMENTED PROCESS AND OPTICAL CHARACTER RECOGNITION SYSTEM|

US10810467B2|2017-11-17|2020-10-20|Hong Kong Applied Science and Technology Research Institute Company Limited|Flexible integrating recognition and semantic processing|

KR20190063277A|2017-11-29|2019-06-07|삼성전자주식회사|The Electronic Device Recognizing the Text in the Image|

GB2571530B|2018-02-28|2020-09-23|Canon Europa Nv|An image processing method and an image processing system|

JP2020119290A|2019-01-24|2020-08-06|富士ゼロックス株式会社|Information processing apparatus and program|

法律状态:

优先权:

申请号 | 申请日 | 专利标题

US14/299,205|US9798943B2|2014-06-09|2014-06-09|Optical character recognition method|

US14/299,205|2014-06-09|

[返回顶部]